May 14th 2020

Introduction to the study

  • Original data
## # A tibble: 242 x 100
##   Snake Reference Note  `SVMP (Snake Ve… `PI-SVMP (Snake… `PII-SVMP (Snak…
##   <chr> <chr>     <chr>            <dbl>            <dbl>            <dbl>
## 1 Agki… https://… Mexi…             24.5                0                0
## 2 Agki… https://… Cost…             30.8                0                0
## 3 Agki… https://… Mexi…             30.6                0                0
## 4 Agki… https://… Orig…             32.5                0                0
## # … with 238 more rows, and 94 more variables: `PIII-SVMP (Snake Venom
## #   Metalloproteinase PIII), %` <dbl>, …
  • Additional data
## # A tibble: 27 x 4
##   Toxin               `Vipera aspis asp… `Vipera berus ber… `Vipera anatolica s…
##   <chr>                            <dbl>              <dbl>                <dbl>
## 1 SVMP (Snake Venom …               13.4                 NA                 42.9
## 2 3Ftx (three-finger…               NA                   NA                 NA  
## 3 Unknown peptides                  NA                   NA                 23.5
## 4 PLA2 (Phospholipas…               30.9                 NA                  8.2
## # … with 23 more rows

Goal of study

  • Perform visual analysis of data distribution
  • Develop a tool for venom composition analysis
  • Group snakes by family based on venom composition (PCA, K-means, ANN)

Materials and methods

  • Data processing and modelling as well as the creation of this presentation was performed in Rstudio Cloud.

  • Coding followed the tidyverse style guide by Hadley Wickham.

  • Results obtained from modelling using Artificial Neural Networks were performed in another project.

  • Whole project exists at github at: https://github.com/rforbiodatascience/2020_group04

Used packages: httr, readxl, tidyverse, knitr, plotly, maps, patchwork, shiny, rsconnect, keras, devtools

Project outline

  • Loading and cleaning data
    • Map locations to country
  • Augmentation of data
    • Merge datasets
    • Create genus and species columns
    • Group toxins
  • Analysis and visualisations
    • Geographical and genus distribution
    • Venom composition analysis
  • Unsupervised analysis
    • PCA
    • K-means clustering
  • Supervised classification model
    • Artificial Neural Network (ANN)

Tidying and transforming data

  • Tidy raw data
    • Load and clean data
  • Transform data
    • Join new data
    • Group toxins
    • Remove toxins found in fewer than five snakes
    • Map genus to snake family
## # A tibble: 233 x 37
##   Snake Genus Species Family Country Reference SVMPi `DC-fragment` CRISP   PLB
##   <chr> <chr> <chr>   <chr>  <chr>   <chr>     <dbl>         <dbl> <dbl> <dbl>
## 1 Agki… Agki… biline… Viper… Mexico  https://…     0           0    0        0
## 2 Agki… Agki… biline… Viper… Costa … https://…     0           0    0        0
## 3 Agki… Agki… biline… Viper… Mexico  https://…     0           0    5.6      0
## 4 Agki… Agki… contor… Viper… Unknown https://…     0           0.1  3.7      0
## 5 Agki… Agki… contor… Viper… USA     https://…     0           0    1.96     0
## 6 Agki… Agki… contor… Viper… USA     https://…     0           0    0        0
## 7 Agki… Agki… contor… Viper… USA     https://…     0           0    1.9      0
## # … with 226 more rows, and 27 more variables: Crotoxin <dbl>, …

Analysis and visualisations

Geographical overview of samples

Snakes from richer countries or countries with a focus on snake research is overrepresented.

Genus distribution according to family

Venom composition in snake families

Venom composition in snake families

Toxin abundances

Comparing venom composition between species

Comparing venom composition within species

Shiny app

Unsupervised and supervised learning

Results from PCA and K-means

Prediction model based on venom composition

A simple vanilla ANN managed to achieve a classification accuracy of 98.2 %

  • Specifications: 4 hidden neurons, test set size: 25 %, validation set size: 20 %, learning rate = 0.005, n_epocs = 100, loss criterion = Binary Crossentropy.

Theoretic analysis of incorrect labels

The one snake being misclassified, is incorrectly labeled as Elapidae.

Analysis of special cases

Incorrectly labeled snake by ANN:

## # A tibble: 1 x 2
##   Snake                    Family   
##   <chr>                    <chr>    
## 1 Daboia russelii russelii Viperidae

Snake from K-means cluster 2:

## # A tibble: 1 x 2
##   Snake             Family  
##   <chr>             <chr>   
## 1 Bungarus candidus Elapidae

Shiny app

Static plots for publication

Static plots for publication

Static plots for publication